238 research outputs found

    Operations for Learning with Graphical Models

    Full text link
    This paper is a multidisciplinary review of empirical, statistical learning from a graphical model perspective. Well-known examples of graphical models include Bayesian networks, directed graphs representing a Markov chain, and undirected networks representing a Markov field. These graphical models are extended to model data analysis and empirical learning using the notation of plates. Graphical operations for simplifying and manipulating a problem are provided including decomposition, differentiation, and the manipulation of probability models from the exponential family. Two standard algorithm schemas for learning are reviewed in a graphical framework: Gibbs sampling and the expectation maximization algorithm. Using these operations and schemas, some popular algorithms can be synthesized from their graphical specification. This includes versions of linear regression, techniques for feed-forward networks, and learning Gaussian and discrete Bayesian networks from data. The paper concludes by sketching some implications for data analysis and summarizing how some popular algorithms fall within the framework presented. The main original contributions here are the decomposition techniques and the demonstration that graphical models provide a framework for understanding and developing complex learning algorithms.Comment: See http://www.jair.org/ for any accompanying file

    Ab initio calculations of stationary points on the benzene-Ar and p-difluorobenzene-Ar potential energy surfaces: barriers to bound orbiting states

    Get PDF
    The potential energy surfaces of the van der Waals complexes benzene–Ar and p-difluorobenzene– Ar have been investigated at the second-order Møller–Plesset (MP2) level of theory with the aug-cc-pVDZ basis set. Calculations were performed with unconstrained geometry optimization for all stationary points. This study has been performed to elucidate the nature of a conflict between experimental results from dispersed fluorescence and velocity map imaging (VMI). The inconsistency is that spectra for levels of p-difluorobenzene–Ar and –Kr below the dissociation thresholds determined by VMI show bands where free p-difluorobenzene emits, suggesting that dissociation is occurring. We proposed that the bands observed in the dispersed fluorescence spectra are due to emission from states in which the rare gas atom orbits the aromatic chromophore; these states are populated by intramolecular vibrational redistribution from the initially excited level [S. M. Bellm, R. J. Moulds, and W. D. Lawrance, J. Chem. Phys. 115, 10709 (2001)]. To test this proposition, stationary points have been located on both the benzene–Ar and p-difluorobenzene–Ar potential energy surfaces (PESs) to determine the barriers to this orbiting motion. Comparison with previous single point CCSD(T) calculations of the benzene–Ar PES has been used to determine the amount by which the barriers are overestimated at the MP2 level. As there is little difference in the comparable regions of the benzene–Ar and p-difluorobenzene–Ar PESs, the overestimation is expected to be similar for p-difluorobenzene–Ar. Allowing for this overestimation gives the barrier to movement of the Ar atom around the pDFB ring via the valley between the H atoms as [less than or equal to] 204 cm⁻¹ in So (including zero point energy). From the estimated change upon electronic excitation, the corresponding barrier in S1 is estimated to be [less than or equal to] 225 cm⁻¹. This barrier is less than the 240 cm⁻¹ energy of 30², the vibrational level for which the anomalous "free p-difluorobenzene" bands were observed in dispersed fluorescence from p-difluorobenzene–Ar, supporting our hypothesis for the origin of these bands.Rebecca J. Moulds, Mark A. Buntine and Warren D. Lawranc

    The identification of informative genes from multiple datasets with increasing complexity

    Get PDF
    Background In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training and evaluation on independent datasets. Our approach includes the following steps: (1) ordering the datasets based on their level of noise and informativeness; (2) selection of a Bayesian classifier with an appropriate level of complexity by evaluation of predictive performance on independent data sets; (3) comparing the different gene selections and the influence of increasing the model complexity; (4) functional analysis of the informative genes. Results In this paper, we identify the most appropriate model complexity using cross-validation and independent test set validation for predicting gene expression in three published datasets related to myogenesis and muscle differentiation. Furthermore, we demonstrate that models trained on simpler datasets can be used to identify interactions among genes and select the most informative. We also show that these models can explain the myogenesis-related genes (genes of interest) significantly better than others (P < 0.004) since the improvement in their rankings is much more pronounced. Finally, after further evaluating our results on synthetic datasets, we show that our approach outperforms a concordance method by Lai et al. in identifying informative genes from multiple datasets with increasing complexity whilst additionally modelling the interaction between genes. Conclusions We show that Bayesian networks derived from simpler controlled systems have better performance than those trained on datasets from more complex biological systems. Further, we present that highly predictive and consistent genes, from the pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study. We conclude that networks trained on simpler controlled systems, such as in vitro experiments, can be used to model and capture interactions among genes in more complex datasets, such as in vivo experiments, where these interactions would otherwise be concealed by a multitude of other ongoing events

    PIDT: A Novel Decision Tree Algorithm Based on Parameterised Impurities and Statistical Pruning Approaches

    Get PDF
    In the process of constructing a decision tree, the criteria for selecting the splitting attributes influence the performance of the model produced by the decision tree algorithm. The most well-known criteria such as Shannon entropy and Gini index, suffer from the lack of adaptability to the datasets. This paper presents novel splitting attribute selection criteria based on some families of pa-rameterised impurities that we proposed here to be used in the construction of optimal decision trees. These criteria rely on families of strict concave functions that define the new generalised parameterised impurity measures which we ap-plied in devising and implementing our PIDT novel decision tree algorithm. This paper proposes also the S-condition based on statistical permutation tests, whose purpose is to ensure that the reduction in impurity, or gain, for the selected attrib-ute is statistically significant. We implemented the S-pruning procedure based on the S-condition, to prevent model overfitting. These methods were evaluated on a number of simulated and benchmark datasets. Experimental results suggest that by tuning the parameters of the impurity measures and by using our S-pruning method, we obtain better decision tree classifiers with the PIDT algorithm

    Incorporating Social Context and Domain Knowledge for Entity Recognition

    Full text link
    Recognizing entity instances in documents according to a knowl-edge base is a fundamental problem in many data mining applica-tions. The problem is extremely challenging for short documents in complex domains such as social media and biomedical domains. Large concept spaces and instance ambiguity are key issues that need to be addressed. Most of the documents are created in a social context by common authors via social interactions, such as reply and citations. Such social contexts are largely ignored in the instance-recognition liter-ature. How can users ’ interactions help entity instance recognition? How can the social context be modeled so as to resolve the ambi-guity of different instances? In this paper, we propose the SOCINST model to formalize the problem into a probabilistic model. Given a set of short document

    Optimal constraint-based decision tree induction from itemset lattices

    No full text
    International audienceIn this article we show that there is a strong connection between decision tree learning and local pattern mining. This connection allows us to solve the computationally hard problem of finding optimal decision trees in a wide range of applications by post-processing a set of patterns: we use local patterns to construct a global model. We exploit the connection between constraints in pattern mining and constraints in decision tree induction to develop a framework for categorizing decision tree mining constraints. This framework allows us to determine which model constraints can be pushed deeply into the pattern mining process, and allows us to improve the state-of-the-art of optimal decision tree induction
    corecore